How to detect UTF-8-based encoded strings [closed]
Posted
by
Diego Sendra
on Programmers
See other posts from Programmers
or by Diego Sendra
Published on 2013-06-24T19:59:04Z
Indexed on
2013/06/24
22:30 UTC
Read the original article
Hit count: 431
A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying.
So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html
Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societé" (Society in english) would return FALSE through calling isUTF8("Societé") while it would return TRUE when calling isUTF8("SocietÈ") since È is the UTF-8 encoded representation of é.
Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post.
Function isUTF8(ByVal ptstr As String)
Dim tUTFencoded As String
Dim tUTFencodedaux
Dim tUTFencodedASCII As String
Dim ptstrASCII As String
Dim iaux, iaux2 As Integer
Dim ffound As Boolean
ffound = False
ptstrASCII = ""
For iaux = 1 To Len(ptstr)
ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|"
Next
tUTFencoded = "Ä|Ã…|Ç|É|Ñ|Ö|ÃŒ|á|Ã|â|ä|ã|Ã¥|ç|é|è|ê|ë|Ã|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|â€|°|¢|£|§|•|¶|ß|®|©|â„¢|´|¨|â‰|Æ|Ø|∞|±|≤|≥|Â¥|µ|∂|∑|âˆ|Ï€|∫|ª|º|Ω|æ|ø|¿|¡|¬|√|Æ’|≈|∆|«|»|…|Â|À|Ã|Õ|Å’|Å“|–|—|“|â€|‘|’|÷|â—Š|ÿ|Ÿ|â„|€|‹|›|ï¬|fl|‡|·|‚|„|‰|Â|Ú|Ã|Ë|È|Ã|ÃŽ|Ã|ÃŒ|Ó|Ô||Ã’|Ú|Û|Ù|ı|ˆ|Ëœ|¯|˘|Ë™|Ëš|¸|Ë|Ë›|ˇ" & _
"Å|Å¡|¦|²|³|¹|¼|½|¾|Ã|×|Ã|Þ|ð|ý|þ" & _
"â‰|∞|≤|≥|∂|∑|âˆ|Ï€|∫|Ω|√|≈|∆|â—Š|â„|ï¬|fl||ı|˘|Ë™|Ëš|Ë|Ë›|ˇ"
tUTFencodedaux = Split(tUTFencoded, "|")
If UBound(tUTFencodedaux) > 0 Then
iaux = 0
Do While Not ffound And Not iaux > UBound(tUTFencodedaux)
If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then
ffound = True
End If
If Not ffound Then
'ASCII numeric search
tUTFencodedASCII = ""
For iaux2 = 1 To Len(tUTFencodedaux(iaux))
'gets ASCII numeric sequence
tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|"
Next
'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1)
'compares numeric sequences
If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then
ffound = True
End If
End If
iaux = iaux + 1
Loop
End If
isUTF8 = ffound
End Function
Function DecodeUTF8(s)
Dim i
Dim c
Dim n
s = s & " "
i = 1
Do While i <= Len(s)
c = Asc(Mid(s, i, 1))
If c And &H80 Then
n = 1
Do While i + n < Len(s)
If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then
Exit Do
End If
n = n + 1
Loop
If n = 2 And ((c And &HE0) = &HC0) Then
c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1)
Else
c = 191
End If
s = Left(s, i - 1) + Chr(c) + Mid(s, i + n)
End If
i = i + 1
Loop
DecodeUTF8 = s
End Function
© Programmers or respective owner